example_1 <- function(path) {
path_list <- str_split(path, "/") %>% unlist()
paste(path_list[length(path_list) - 1], path_list[length(path_list)], sep = "_")
}
example_1(c("foo/bar/car/001.txt", "har/far/lar/002.txt"))[1] "lar_002.txt"
2023-06-28
Rcppdata.tableFor loops in R are slow - many StackOverflow posts
for loop (or a sapply variant), then your code could be sped up by using a vectorised functionf(x[i]) = f(x)[i] for \(i \in 1, ..., N\)sqrt(c(4, 9, 16)) = 2, 3, 4, therefore sqrt is vectorisedfor (i in 1:nrow(df)) {
# New column based on 2 others
if (df$x[i] > 5 && df$y[i] < 0) {
df$y[i] <- 5
} else {
df$y[i] <- 0
}
# Replace NAs with error code
if (is.na(df$z[i])) {
df$z[i] <- 9999
}
# String concatenate columns
df$xy[i] <- paste(df$x[i], df$y[i], sep="_")
}
# Distance between every row
dists <- matrix(nrow=nrow(df), ncol=nrow(df))
for (i in 1:nrow(df)) {
for (j in 1:nrow(df)) {
if (j != i) {
dists[i, j] <- sqrt((df$x[i] - df$x[j])**2 + (df$y[i] - df$y[j])**2)
}
}
}Given a path some/path/abc/001.txt, create a fast function to return abc_001.txt
/ and concatenating the last directory and filenameVectorizeVectorize takes a function that works on a single element, and returns a vectorised version - job done!apply under the hood and isn’t quicker, mostly just syntatical sugarstr_split is vectorised, returning a list over the input entriespaste is also vectorisedsapply) to grab the last dir and filename from each entrydirname('foo/bar/dog.txt') = foo/barbasename('foo/bar/dog.txt') = dog.txtmicrobenchmark library makes it easy to time snippets of codeVectorize version isn’t doing anything different from manually looping through with sapplylibrary(microbenchmark)
# Construct 100 paths
paths <- rep(c("some/path/abc/001.txt", "another/directory/xyz/002.txt"), 100)
res <- microbenchmark(
example_1_vectorised(paths),
sapply(paths, example_1),
example_2(paths),
example_3(paths)
)
summary(res)[c("expr", "median")] expr median
1 example_1_vectorised(paths) 7868.738
2 sapply(paths, example_1) 7896.205
3 example_2(paths) 1416.520
4 example_3(paths) 139.350
In general, if you’re using a
forloop (or asapplyvariant), then your code could be sped up by using avectorisedfunction - Me (7 slides ago)
Vectorize or an inbuilt function like basename is that the latter will have a for loop, but it will be written in C/C++ rather than RIn general, if you’re using a
forloop (or asapplyvariant), then your code could be sped up by using a for loop written in C/C++, preferably part of the standard library - Me (now)
data.frames and the tidyverse core verbs pushes you towards using vectorised functionsgroup_by() |> summarise() is both quicker and more legible than manually looping over the groups and combining the resultsfilter(f(x)) assumes that f() is vectorised and returns a boolean TRUE/FALSE for every rowmutate(newcol=f(oldcol)) assumes f() is vectorised and returns a value per row# Non-vectorised version didn't Error,
# but gave an unexpected result
data.frame(path=paths) |>
mutate(path_clean1 = example_1(path),
path_clean2 = example_3(path)) |>
head() path path_clean1 path_clean2
1 some/path/abc/001.txt xyz_002.txt abc_001.txt
2 another/directory/xyz/002.txt xyz_002.txt xyz_002.txt
3 some/path/abc/001.txt xyz_002.txt abc_001.txt
4 another/directory/xyz/002.txt xyz_002.txt xyz_002.txt
5 some/path/abc/001.txt xyz_002.txt abc_001.txt
6 another/directory/xyz/002.txt xyz_002.txt xyz_002.txt
# This function isn't vectorised due to the if/else statement
# Solution: Use ifelse() instead
replace_both_NA_9999 <- function(x, y) {
if (is.na(x) && is.na(y)) {
return(9999)
} else {
return(0)
}
}
data.frame(a=c(5, 3, NA, 2, NA),
b=c(NA, 2, NA, 1, 9)) |>
mutate(c = replace_both_NA_9999(a, b))Error in `mutate()`:
ℹ In argument: `c = replace_both_NA_9999(a, b)`.
Caused by error in `is.na(x) && is.na(y)`:
! 'length = 5' in coercion to 'logical(1)'
join family of functions is an integral part of data analysisjoin functions are highly efficient functions and can be useful in a number of siutations, even when we don’t have 2 separate datasetsinner_join links two dataframes together based on a column in common, with the number of rows equal to the number of rows in the ‘left’ table that have a matching row in the ‘right’ tableifelseinner_join as being able to both filter and mutate new columnsjoin isn’t much quicker, but it’s far more legible and scales well to both having more groups in the joining column, and additional joining columns| group | time | value |
|---|---|---|
| a | 2020-03-05 00:00:00 | 0.7225352 |
| a | 2020-03-05 00:01:00 | 0.9641624 |
| a | 2020-03-05 00:02:00 | 0.3115964 |
| a | 2020-03-05 00:03:00 | -0.0363529 |
| a | 2020-03-05 00:04:00 | -0.6124932 |
| group | scale |
|---|---|
| a | 2.0 |
| b | 7.8 |
| c | 9.0 |
| group | time | value | scale |
|---|---|---|---|
| a | 2020-03-05 00:00:00 | 0.7225352 | 2 |
| a | 2020-03-05 00:01:00 | 0.9641624 | 2 |
| a | 2020-03-05 00:02:00 | 0.3115964 | 2 |
| a | 2020-03-05 00:03:00 | -0.0363529 | 2 |
| a | 2020-03-05 00:04:00 | -0.6124932 | 2 |
f_join <- function() {
df |> inner_join(scales, by="group")
}
f_ifelse <- function() {
df |>
mutate(scale = ifelse(group == 'a', 2,
ifelse(group == 'b', 7.8,
ifelse(group == 'c', 9, NA))))
}
res <- microbenchmark(f_join(), f_ifelse(), times=10)
summary(res)[c("expr", "median")] expr median
1 f_join() 22.45482
2 f_ifelse() 27.31786
left_joinleft_join returns all rows in the left table, but only those in the right that match the conditionNAleft_join| time | measurement |
|---|---|
| 2020-01-01 | 2.9016961 |
| 2020-01-02 | 0.0465646 |
| 2020-01-03 | -1.5239673 |
| 2020-01-04 | 0.2516128 |
| 2020-01-05 | 0.1280887 |
| 2020-01-06 | -2.8242639 |
| 2020-01-07 | -1.4247783 |
| 2020-01-08 | 1.4217469 |
| 2020-01-09 | -1.5889684 |
| 2020-01-10 | -0.7023941 |
| week_group | week_start | week_end |
|---|---|---|
| a | 2020-02-14 | 2020-02-21 |
| b | 2020-03-17 | 2020-03-24 |
| c | 2020-05-08 | 2020-05-15 |
| d | 2020-09-20 | 2020-09-27 |
| e | 2020-11-13 | 2020-11-20 |
| time | measurement | week_group | week_start | week_end |
|---|---|---|---|---|
| 2020-02-14 | -0.2035220 | a | 2020-02-14 | 2020-02-21 |
| 2020-02-15 | -0.2149556 | a | 2020-02-14 | 2020-02-21 |
| 2020-02-16 | 0.3451100 | a | 2020-02-14 | 2020-02-21 |
| 2020-02-17 | -0.4586642 | a | 2020-02-14 | 2020-02-21 |
| 2020-02-18 | -0.6964497 | a | 2020-02-14 | 2020-02-21 |
| 2020-02-19 | 2.4552832 | a | 2020-02-14 | 2020-02-21 |
| 2020-02-20 | 0.7747280 | a | 2020-02-14 | 2020-02-21 |
| 2020-03-17 | -0.6530212 | b | 2020-03-17 | 2020-03-24 |
| 2020-03-18 | -1.3282548 | b | 2020-03-17 | 2020-03-24 |
| 2020-03-19 | 0.7340179 | b | 2020-03-17 | 2020-03-24 |
f_intervaljoin <- function() {
df_interval |>
inner_join(weeks, by=join_by(time >= week_start, time < week_end))
}
f_ifelse <- function() {
df_interval |>
mutate(week_group = ifelse(time >= as_date("2020-02-14") & time < as_date("2020-02-21"),
'a',
ifelse(time >= as_date("2020-03-17") & time < as_date("2020-03-24"),
'b',
ifelse(time >= as_date("2020-05-08") & time < as_date("2020-05-15"),
'c',
ifelse(time >= as_date("2020-09-20") & time < as_date("2020-09-27"),
'd',
ifelse(time >= as_date("2020-11-13") & time < as_date("2020-11-20"),
'e',
NA)))))) |>
filter(!is.na(week_group))
}
res <- microbenchmark(f_intervaljoin(), f_ifelse(), times=10)
summary(res)[c("expr", "median")] expr median
1 f_intervaljoin() 2.254611
2 f_ifelse() 27.822321
df <- read_csv("BasicCompanyDataAsOneFile-2023-05-01.csv", n_max=1e6, show_col_types=FALSE)
df$IncorporationDate <- as_date(df$IncorporationDate, format="%d/%m/%Y")
dim(df)[1] 1000000 55
# A tibble: 6 × 4
CompanyName RegAddress.PostTown IncorporationDate SICCode.SicText_1
<chr> <chr> <date> <chr>
1 ! HEAL UR TECH LTD GUILDFORD 2022-10-12 33140 - Repair o…
2 ! LTD LEEDS 2012-09-11 99999 - Dormant …
3 !? LTD ROMILEY 2018-06-05 47710 - Retail s…
4 !BIG IMPACT GRAPHICS … LONDON 2018-12-28 18129 - Printing…
5 !GOBERUB LTD BISHOP'S STORTFORD 2021-05-17 62020 - Informat…
6 !NFOGENIE LTD LONDON 2021-07-21 58290 - Other so…
YOXX YYY postcode formatinner_join to filter)# A tibble: 6 × 5
CompanyName SICCode.SicText_1 SICCode.SicText_2 SICCode.SicText_3
<chr> <chr> <chr> <chr>
1 ! HEAL UR TECH LTD 33140 - Repair o… 47421 - Retail s… <NA>
2 ! LTD 99999 - Dormant … <NA> <NA>
3 !? LTD 47710 - Retail s… <NA> <NA>
4 !BIG IMPACT GRAPHICS LI… 18129 - Printing… 59112 - Video pr… 63120 - Web port…
5 !GOBERUB LTD 62020 - Informat… 70229 - Manageme… 79110 - Travel a…
6 !NFOGENIE LTD 58290 - Other so… <NA> <NA>
# ℹ 1 more variable: SICCode.SicText_4 <chr>
# 755 rows containing the SIC codes that at least 10 companies have
# Only 1 column, SICCode.SicText_1
sic_10_companies <- df |>
count(SICCode.SicText_1) |>
filter(n >= 10) |>
select(SICCode.SicText_1)
df |>
# Could do a filter to restrict to these 10 companies, but it's actually quicker to use an inner join
inner_join(sic_10_companies, by="SICCode.SicText_1") |>
select(CompanyNumber, SICCode.SicText_1, SICCode.SicText_2, SICCode.SicText_3, SICCode.SicText_4) |>
mutate(first_classification = SICCode.SicText_1) |>
# Pivoting to make it easier to count how many non-NULL classifications each company has
pivot_longer(c(SICCode.SicText_1, SICCode.SicText_2, SICCode.SicText_3, SICCode.SicText_4)) |>
filter(!is.na(value)) |>
# Count how many classifications each company has
count(CompanyNumber, first_classification) |>
# Calculate the average number per the first classification
group_by(first_classification) |>
summarise(mean_classifications = mean(n, na.rm=T)) |>
arrange(desc(mean_classifications)) |>
head()# A tibble: 6 × 2
first_classification mean_classifications
<chr> <dbl>
1 7210 - Hardware consultancy 3
2 07100 - Mining of iron ores 2.48
3 10611 - Grain milling 2.43
4 18110 - Printing of newspapers 2.43
5 14131 - Manufacture of other men's outerwear 2.40
6 01280 - Growing of spices, aromatic, drug and pharmaceut… 2.39
data.table is an alternative to data.frame/tibble that is optimised for speed and low memory usagelibrary(data.table)
dt <- fread("BasicCompanyDataAsOneFile-2023-05-01.csv", nrows=1e6) # fread is the equivalent of read.csv
dt[, IncorporationDate := as_date(IncorporationDate, format="%d/%m/%Y") ] # Creates a new column by *reference*
dim(dt)[1] 1000000 55
# Display rows 1-5 and the specified columns
dt[1:5, .(CompanyName, RegAddress.PostTown, IncorporationDate, SICCode.SicText_1)] CompanyName RegAddress.PostTown IncorporationDate
1: ! HEAL UR TECH LTD GUILDFORD 2022-10-12
2: ! LTD LEEDS 2012-09-11
3: !? LTD ROMILEY 2018-06-05
4: !BIG IMPACT GRAPHICS LIMITED LONDON 2018-12-28
5: !GOBERUB LTD BISHOP'S STORTFORD 2021-05-17
SICCode.SicText_1
1: 33140 - Repair of electrical equipment
2: 99999 - Dormant Company
3: 47710 - Retail sale of clothing in specialised stores
4: 18129 - Printing n.e.c.
5: 62020 - Information technology consultancy activities
dt[i, j, k] means for data table dt, filter on rows i, create and/or select columns j, and group by kdata.table operations don’t use the Pipe (|> or %>%), so can either chain together [] or create intermediate variablesdata.table have data.frame as a class so can use standard functions on them, just won’t benefit from the speed up.N is the equivalent of count.SD applies an operation to a subset of columns (all by default)x[y] is equal to left_join(y, x), NOT inner_join(x, y)melt is equivalent to pivot_longer and IMO less intuitivesic_10_companies_dt <- dt[, .N, by=.(SICCode.SicText_1)][ N >= 10, .(SICCode.SicText_1) ]
dt_companies_wide <- dt[ sic_10_companies_dt, # This is a join!
.(CompanyNumber,
first_classification = SICCode.SicText_1,
SICCode.SicText_1,
SICCode.SicText_2,
SICCode.SicText_3,
SICCode.SicText_4),
on=.(SICCode.SicText_1)]
dt_companies_long <- melt(dt_companies_wide, id.vars=c('CompanyNumber', 'first_classification'))
dt_companies_mean <- dt_companies_long[ value != '', # Removes the unused SIC columns
.N,
by=.(CompanyNumber, first_classification)][,
.(mean_classifications = mean(N, na.rm=T)),
by=.(first_classification)]
head(dt_companies_mean[ order(mean_classifications, decreasing = TRUE)]) first_classification
1: 7210 - Hardware consultancy
2: 07100 - Mining of iron ores
3: 18110 - Printing of newspapers
4: 10611 - Grain milling
5: 14131 - Manufacture of other men's outerwear
6: 01280 - Growing of spices, aromatic, drug and pharmaceutical crops
mean_classifications
1: 3.000000
2: 2.478261
3: 2.428571
4: 2.428571
5: 2.401606
6: 2.392157
tidytable and dtplyrtidytable: introductiontidytable is a drop-in replacement for common tidyverse functions that under the hood work on a data.table objectdata.table but the user friendly API of the tidyversemutate, inner_join, count, select, filter etc… will use the tidytable versions that work on a data.tabledplyr, tidyr, purrr functionstidyverse being reflected in tidytablelibrary(tidytable)
# Here we explicitly create tidytable from a regular data.frame
# But passing a regular data.frame or data.table into any tidytable function
# will implicitly change it to be a tidytable object
dtt <- as_tidytable(df)
dtt |>
count(SICCode.SicText_1) |>
filter(n >= 10) |>
select(SICCode.SicText_1) # A tidytable: 755 × 1
SICCode.SicText_1
<chr>
1 01110 - Growing of cereals (except rice), leguminous crops and oil seeds
2 01120 - Growing of rice
3 01130 - Growing of vegetables and melons, roots and tubers
4 01160 - Growing of fibre crops
5 01190 - Growing of other non-perennial crops
6 01210 - Growing of grapes
7 01220 - Growing of tropical and subtropical fruits
8 01240 - Growing of pome fruits and stone fruits
9 01250 - Growing of other tree and bush fruits and nuts
10 01270 - Growing of beverage crops
# ℹ 745 more rows
dtplyr: introductiondata.table wrapper is dtplyr (developed by RStudio team)tidytable: it sequentially builds up the equivalent data.table query, but only executes the code when you explicitly request it (using collect() or as.data.frame/table())tidytablelibrary(dtplyr)
# dtplyr operates on `lazy data.tables` which are only created by this function
dtp <- lazy_dt(df)
dtp |>
count(SICCode.SicText_1) |>
filter(n >= 10) |>
select(SICCode.SicText_1) Source: local data table [755 x 1]
Call: `_DT1`[, .(n = .N), keyby = .(SICCode.SicText_1)][n >= 10, .(SICCode.SicText_1)]
SICCode.SicText_1
<chr>
1 01110 - Growing of cereals (except rice), leguminous crops and oil seeds
2 01120 - Growing of rice
3 01130 - Growing of vegetables and melons, roots and tubers
4 01160 - Growing of fibre crops
5 01190 - Growing of other non-perennial crops
6 01210 - Growing of grapes
# ℹ 749 more rows
# Use as.data.table()/as.data.frame()/as_tibble() to access results
dtplyr: usagedata.table query (subtly different to the one I manually wrote)collect() to execute it and return a tibbledtp |>
count(SICCode.SicText_1) |>
filter(n >= 10) |>
select(SICCode.SicText_1) |>
collect() |>
head()# A tibble: 6 × 1
SICCode.SicText_1
<chr>
1 01110 - Growing of cereals (except rice), leguminous crops and oil seeds
2 01120 - Growing of rice
3 01130 - Growing of vegetables and melons, roots and tubers
4 01160 - Growing of fibre crops
5 01190 - Growing of other non-perennial crops
6 01210 - Growing of grapes
dtplyr: chaining queriesdtplyr queries that haven’t been collect() can be used in joins# NB: this returns a datatable QUERY, not a dataset itself
sic_10_companies_dtp <- dtp |>
count(SICCode.SicText_1) |>
filter(n >= 10) |>
select(SICCode.SicText_1)
# Can join that query into the middle of another query to return another query
results_dtp <- dtp |>
inner_join(sic_10_companies_dtp, by="SICCode.SicText_1") |>
select(CompanyNumber, SICCode.SicText_1, SICCode.SicText_2, SICCode.SicText_3, SICCode.SicText_4) |>
mutate(first_classification = SICCode.SicText_1) |>
pivot_longer(c(SICCode.SicText_1, SICCode.SicText_2, SICCode.SicText_3, SICCode.SicText_4)) |>
filter(!is.na(value)) |>
count(CompanyNumber, first_classification) |>
group_by(first_classification) |>
summarise(mean_classifications = mean(n, na.rm=T)) |>
arrange(desc(mean_classifications))
# Finally execute the full query
results_dtp |>
collect() |>
head()# A tibble: 6 × 2
first_classification mean_classifications
<chr> <dbl>
1 7210 - Hardware consultancy 3
2 07100 - Mining of iron ores 2.48
3 10611 - Grain milling 2.43
4 18110 - Printing of newspapers 2.43
5 14131 - Manufacture of other men's outerwear 2.40
6 01280 - Growing of spices, aromatic, drug and pharmaceut… 2.39
SQLite (designed for ‘traditional’ DB applications)duckdb (optimised for analysis)tidyverse functions just like dtplyr rather than learn a new languagedbConnect() from library(DBI)dbWriteTable() will write a dataframe to the DBdtplyr, except the source data is from tbl# tbl requires a connection and a table name to read data from
tbl(con_sql, "data_1e6") |>
filter(RegAddress.PostTown == 'YORK') |>
mutate(postcode = word(RegAddress.PostCode, 1)) |>
count(postcode) |>
arrange(desc(n)) |>
head(5) |>
show_query()<SQL>
SELECT `postcode`, COUNT(*) AS `n`
FROM (
SELECT *, word(`RegAddress.PostCode`, 1.0) AS `postcode`
FROM `data_1e6`
WHERE (`RegAddress.PostTown` = 'YORK')
)
GROUP BY `postcode`
ORDER BY `n` DESC
LIMIT 5
word into SQL yetsubstr from base which has been translatedduckdb: introductiondbConnect() function but passing in a different driverduckdb: usagetbl(con_dd, "data_1e6") |>
filter(RegAddress.PostTown == 'YORK') |>
mutate(postcode = substr(RegAddress.PostCode, 1, 4)) |>
count(postcode) |>
arrange(desc(n)) |>
head(50) |>
show_query()<SQL>
SELECT postcode, COUNT(*) AS n
FROM (
SELECT *, SUBSTR("RegAddress.PostCode", 1, 4) AS postcode
FROM data_1e6
WHERE ("RegAddress.PostTown" = 'YORK')
) q01
GROUP BY postcode
ORDER BY n DESC
LIMIT 50
word is also not ported to duckdb so again use the substr versionSQLite and dtplyrtbl(con_dd, "data_1e6") |>
filter(RegAddress.PostTown == 'YORK') |>
mutate(postcode = substr(RegAddress.PostCode, 1, 4)) |>
count(postcode) |>
arrange(desc(n)) |>
head(50) |>
collect()# A tibble: 23 × 2
postcode n
<chr> <dbl>
1 "YO30" 486
2 "YO19" 325
3 "YO26" 271
4 "YO1 " 241
5 "YO31" 180
6 "YO24" 178
7 "YO10" 172
8 "YO32" 169
9 "YO23" 161
10 "YO42" 142
# ℹ 13 more rows
data.table is the fastest! But it requires learning a new ‘language’tidyverse and let you use same codetidytable is my personal sweetspot between ease of use and performance gainsduckdb and sqlite are also useful when data storage is a concern:
Rcpp, which combines R and C++, to the rescue!Rcpp makes this process easy by providing:
Rcpp::cppFunction() to write a C++ function as a string or Rcpp::sourceCpp() if it’s in a separate filelibrary(Rcpp)
cppFunction("
double sumRcpp(NumericVector x) {
// The function definition syntax is:
// RETURN_TYPE functionName(INPUT_TYPE input, ...)
int n = x.size(); // R objects have their own type (NumericVector) with useful attributes
double total = 0; // Need to instantiate variables before use
for (int i = 0; i < n; i++) { // C++ indexes start at 0
total += x[i];
}
return total; // Need to explicitly return values
}")
# The sumRcpp function is instantly available within R
sumRcpp(c(1, 2, 3))[1] 6
NumericVector is a special Rcpp data structure that represents a vector of floats, so will accept both c(1.2, 2.4, 3.6), and c(1, 2, 3), but not c('a', 'b', 'c')IntegerVector coerces floats into integersCharacterVector also exists for strings, and NumericMatrix for 2D structuresint, double, char etc…wrap() is an Rcpp function that converts back from C++ to R, useful when returning at the end of a function!mean, log, exp, sin, any, allsum benchmarkskf_r <- function(y, m, Q=0.5, H=0.5) {
n <- length(y)
alpha <- array(NA, dim=c(n+1, m))
P <- array(NA, dim=c(m, m, n+1))
alpha[1] <- 0 # Initialise with zero mean and high variance
P[, , 1] <- 1e3
Z <- array(1, dim=c(n, m))
for (i in 1:n) {
P_updated <- P[, , i] + Q
# Calculate kalman gain
K <- P_updated %*% t(Z[i, ]) %*% solve(Z[i, ] %*% P_updated %*% t(Z[i, ]) + H)
# Update state and covariance
alpha[i+1, ] <- alpha[i, ] + K %*% (y[i] - Z[i, ] %*% alpha[i, ])
P[, , i+1] <- (diag(m) - K %*% Z[i, ]) %*% P_updated
}
list(alpha=alpha, P=P)
}cppFunction("
NumericVector kf_rcpp(arma::vec y, int m, float Q=0.5, float H=0.5) {
int n = y.n_rows;
arma::mat alpha(n+1, m, arma::fill::none);
arma::cube P(m, m, n+1, arma::fill::none);
arma::mat Z(n, m, arma::fill::ones);
// Initialise with zero mean and high variance
alpha.row(0).fill(0);
P.slice(0).diag().fill(1000);
// Run filter
arma::mat P_updated(m, m);
arma::mat K(m, m);
for (int i=0; i<n; i++) {
P_updated = P.slice(i) + Q;
// Calculate kalman gain:
K = P_updated * Z.row(i).t() * arma::inv(Z.row(i) * P_updated * Z.row(i).t() + H);
// Update state and covariance
alpha.row(i+1) = alpha.row(i) + K * (y[i] - Z.row(i) * alpha.row(i));
P.slice(i+1) = (arma::eye(m, m) - K * Z.row(i)) * P_updated;
}
return wrap(alpha); // This is crucial, converts the Armadillo matrix into an R NumericVector
}", depends="RcppArmadillo")RcppArmadillo library for access to 3D arrayskf_r() or kf_rcpp() identicallyparallel::mclapply (Linux), doParallel and foreach (Windows), or furrr (all OS, Tidyverse, combines future and purrr)future.batchtools offers the ability automatically create and submit Slurm job scripts from within R, allowing you to easily switch between running sequentially, local multi-core parallelisation, and independent processes Slurm array jobs. UNTESTEDdata.table in some form - weeklyRcpp - couple of times a yearduckdb in future for some large datasetsdata.table: vignette, syntax comparison with tidyversetidytable vs dtplyr: benchmarking between tidytable, data.table, dtplyr, and tidyverse and pandas (nb: from tidytable author)duckdb: official docsRcpp: chapter in Advanced R, Rcpp book (thorough), Rcpp for everyone book (accessible)basename and dirname solution was faster than regexexample_4 <- function(paths) {
gsub(".+\\/+([[:alnum:]]+)\\/([[:alnum:]]+\\.[[:alnum:]]+)$", "\\1_\\2", paths)
}
example_4(c("foo/bar/car/001.txt", "har/far/lar/002.txt"))[1] "car_001.txt" "lar_002.txt"
expr median
1 example_1_vectorised(paths) 6973.2335
2 sapply(paths, example_1) 6793.3205
3 example_2(paths) 1232.2500
4 example_3(paths) 125.4765
5 example_4(paths) 387.4380
ifelse and case_whenf_casewhen <- function() {
df_interval |>
mutate(week_group = case_when(
time >= as_date("2020-02-14") & time < as_date("2020-02-21") ~ 'a',
time >= as_date("2020-03-17") & time < as_date("2020-03-24") ~ 'b',
time >= as_date("2020-05-08") & time < as_date("2020-05-15") ~ 'c',
time >= as_date("2020-09-20") & time < as_date("2020-09-27") ~ 'd',
time >= as_date("2020-11-13") & time < as_date("2020-11-20") ~ 'e',
.default = NA_character_
)) |>
filter(!is.na(week_group))
}
res <- microbenchmark(f_intervaljoin(), f_ifelse(), f_casewhen(), times=10)
summary(res)[c("expr", "median")] expr median
1 f_intervaljoin() 1.930972
2 f_ifelse() 22.516736
3 f_casewhen() 22.306646
inner_join than filter expr median
1 f_filter() 898.7851
2 f_inner_join() 800.7263